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Abstract — The performance of a parallel algorithm in a very 
large scale grid is significantly influenced by the underlying 
Internet protocols and inter-connectivity. Many grid program- 
ming platforms use TCP due to its reliability, usually with 
some optimizations to reduce its costs. However, TCP does not 
perform well in a high bandwidth and high delay network 
environment. On the other hand, UDP is the fastest protocol avail- 
able because it omits connection setup process, acknowledgments 
and retransmissions sacrificing reliable transfer. Many new bulk 
data transfer schemes using UDP for data transmission such as 
RBUDP, Tsunami, and SABUL have been introduced and shown 
to have better performance compared to TCP. In this paper, we 
consider the use of UDP and examine the relationship between 
packet loss and speedup with respect to the number of grid nodes. 
Our measurement suggests that packet loss rates between 5%- 
15% on average are not uncommon between PlanetLab nodes 
that are widely distributed over the Internet. We show that 
transmitting multiple copies of same packet produces higher 
speedup. We show the minimum number of packet duplication 
required to maximize the possible speedup for a given number of 
nodes using a BSP based model. Our work demonstrates that by 
using an appropriate number of packet copies, we can increase 
performance of parallel program. 

Index Terms — Modeling and prediction, probabilistic compu- 
tation, parallelism, UDP, performance analysis, parallel algorithm 
complexity. 



I. Introduction 

PARALLEL computing has become increasingly popular 
due to widespread availability of cost effective compu- 
tational resources, such as commodity SMPs, PCs and high- 
performance cluster platforms. The size of individual clusters 
has also continued to grow as evidenced by data collected 
from the top 500 supercomputers 1 1 1. This has benefited large- 
scale application research that was once accessible only to a 
relatively small number of researchers. 

However, as the size of computational grids continues to 
grow, to become very large scale grids (VLSG), the number 
of wide area network (WAN) connections between islands of 
clusters and other high performance computing (HFC) centers 
grows quadratically to the number of nodes. These WAN con- 
nections put limits on the granularity of parallel applications 
that could otherwise benefit from the available computing 
power, i.e. computation to communication ratio needs to be 
significantly large in order for the communication complexity 
to not dominate the run-time. Embarrassingly parallel, data 
parallel and parametric problems that do not require significant 
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message passing can be efficiently parallelized but problems 
that require significant communication present challenges. It is 
important to understand how these problems can be efficiently 
parallelized. The approach we consider in this paper is to 
understand the effect of the WAN connections by examining 
the relationship of network bandwidth, delay, and loss of 
packets with speedup. 

Transmission Control Protocol (TCP) [2J and User Data- 
gram Protocol (UDP) 131 are currently the predominant pro- 
tocols used for end-to-end communication. TCP provides 
useful services such as connection-oriented, streaming, full- 
duplex, reliable, and end-to-end semantic to its applications. 
These services provide reliability at a cost, causing delay 
in transmission. Some of these services can be sacrificed 
depending on the types of applications executed over WAN. 
Typical grid programming platforms use TCP or TCP with 
some optimization, but TCP does not perform well in a high 
bandwidth and high delay network environment 0, Il5j|. 

As network bandwidth increases rapidly together with 
the advent of new routing/switching technology like Multi- 
protocol Label Switching (MPLS), the load on end systems 
and the data transfer protocols are becoming bottlenecks 
in many cases |6|. This indicates performance of parallel 
programs (mainly constrained by communication phase) on 
WAN is no longer hardware constrained. Thus we emphasize 
our study on the bottleneck caused by the data transfer protocol 
with assumption that end systems with manageable load are 
used in computing on WAN. 

TCPs congestion control algorithm (exponential back-off) 
causes packet transfer throughput to collapse even when the 
bandwidth is still plentiful. It is important to realize that packet 
losses do not necessarily happen just because of network 
congestion. It is well known that TCP was originally designed 
for reliable data communication on low bandwidth and high 
error rate networks |2|. A packet is retransmitted when it gets 
corrupted or lost. The reliability provided by TCP reduces 
network throughput, increases average delay and worsens 
delay jitter Q, E). While reliable transmission is quite often 
critical for the proper execution of a parallel process, use of 
TCP is not the only means of attaining reliability and it is 
not clear whether TCP algorithms in general are the correct 
approach with respect to communication patterns of parallel 
processes. It is generally accepted that TCP is not suitable 
for delay constrained applications that emphasize performance 
issues. 

UDP on the other hand tends to be the fastest protocol 
because it omits connection setup process, acknowledgments, 
and retransmission. Each packet sent is independent of all 
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Other packets. However, unlike TCP, packet losses can occur 
and a mechanism has to be provided to take necessary measure 
to detect and assure successful delivery of packets |[3]. 

Many high performance data transfer protocols have been 
developed using UDP recently. The Tsunamif9l reliable file 
transfer protocol was developed for high-bandwidth dedicated 
research networks that never experience significant conges- 
tion. It contains two user-space applications (a client and a 
server) and uses UDP for data transfer and TCP for sending 
controls information (such as retransmission request, restart 
request, error report and completion report). This protocol 
uses inter-packet delay as means of flow control as opposed to 
sliding-window mechanism in TCP. Another protocol known 
as Reliable Blast UDP (RBUDP)| 10| is an aggressive bulk data 
transfer scheme. It is intended for extremely high bandwidth, 
dedicated or Quality-of-Service enabled networks such as opti- 
cally switched networks. This scheme sends the entire payload 
at a user-specified sending rate using UDP datagrams and TCP 
is used to send signal indicating the end of transmission from 
sender and acknowledgment from receiver consisting bitmap 
tally of the received packets. This work also demonstrates that 
the load factor of receiving node contributes to packet loss rate. 
Simple Available Bandwidth Utilization Library (SABUL) 
161 is another high performance data transfer protocol for 
data intensive application over high bandwidth network. This 
reliable and lightweight application protocol uses UDP for data 
transfer and TCP for feedback control messages. It also uses 
rate based congestion control mechanism that tunes the inter- 
packet time as in Tsunami. All this work shows that the use of 
UDP with some reliability can enhance performance for WAN 
based application that require immense data movements. Thus, 
we investigate the performance of UDP for parallel computing 
over WAN. 

Our work considers the possibility of achieving good par- 
allel processing speedup using UDP on a WAN with some 
reliability via acknowledgment packet. We investigate the 
effect of packet loss on speedup and present a variant of 
the Bulk Synchronous Parallel (BSP) processing model that 
considers packet loss as a fundamental parameter. This is 
because we believe packet loss is the main contributor in per- 
formance of UDP based transmission since it is an unreliable 
protocol. To counter the unreliability of UDP we introduce 
an acknowledgment packet from the receiver. Thus the packet 
sending nodes involved in the communication phase knows if 
a packet is lost (i.e. a packet is not received by a receiving 
node). 

A. UDP measurements on PlanetLab 

To obtain a realistic view of packet loss, bandwidth and 
round-trip-time on a very large scale grid, we measured UDP 
behavior for different sizes of packets, between randomly 
selected nodes from the top level domain ending with ".edu" 
within PlanetLab ifTTl . We used utility programs and scripts 
that we developed to select pairs of nodes randomly from al- 
most 160 ".edu" nodes. These nodes are then used to measure 
end to end packet loss, round-trip-time and bandwidth using 
UDP. 100 pairs of nodes were used in the experiment and each 



pairs were run one at a time. Average packet losses between 
5%-15% are registered on this platform, plotted in Fig.[T] It is 
also interesting to note that the percentage of packet loss are 
independent of packet size for up to 10 k/bytes, with loss of 
less than 10% and increases slightly to about 15% for larger 
packet sizes. There are cases when packet losses exceeds 15%, 
this is probably due to high load and physical links on end 
systems. Bandwidth and round-trip-time was also measured 
for UDP on PlanetLab as depicted in Fig. |2] and Fig. [3] respec- 
tively. We observed that bandwidth of SOMbytes per second 
to SOMbytes per second on average can be achieved using 
UDP. Round-trip-time between 0.05s and 0.1s on average are 
recorded for packet sizes of up to 25Kbytes. Using these 
information we analyzed the best speedup that can be achieved 
for differing packet loss probability. The extent to which these 
measurements are indicative of the Internet in general is to be 
further investigated, though it is reasonable to suggest that a 
large scale, shared grid system will exhibit similar behavior. 

In section 2, we introduce stochastic models to analyze the 
impact of packet loss on speedup; section 3 explains how we 
derived the number of packet copies to use for maximizing the 
speedup; section 4 analyzes speedup of parallel computation 
when only lost packets are re-transmitted; section 5 discuss 
some related work on traditional systems and on WAN sys- 
tems; and in section 6 we summarize our conclusions and 
future work. 

II. The approach 

This section introduces a couple of approach used to evalu- 
ate performance of parallel algorithms that use UDP protocol 
for communication purposes. We begin with a conceptual 
approach and later to a more realistic BSP fVT\ based model 
that reflects the effect of packet loss. In our analysis, we used 
the notion of sending multiple copies of the same packet such 
that the probability of packet loss approaches zero. 

Consider sending a packet of data between 2 nodes with 
a fixed loss probability of p. We assume probability of data 
packet loss and acknowledgment packet loss are identical. 
There are three scenarios that could happen as depicted in 
Fig-El /) The data packet sent by sender is successfully 
received by the receiving node, and an acknowledgment packet 
sent by the receiver is successfully received by the sender. This 
happens with a probability of (1 — p)'^. //) The data packet 
sent by sender is successfully received by the receiver but the 
acknowledgment packet sent by the receiver is lost, with a 
probability of (1 — ///) The data packet sent by the sender 
is lost, with probability of p. 

The probability of successful delivery is thus given by 
(1— p)^. Let c(n) be the total number of packets transmitted for 
a particular communication primitive involving n processors. 
The probability of successful delivery of c(n) data packets 
and successfully receiving the acknowledgment packets is 
Ps{n,p) — (1 — and the probability of at least one 

packet loss ispf{n,p) = l~ps{n,p). Ideally, as n — > oo, we 
have maximum capacity for speedup in computing, however 
Pf{n,p) — > 1 and thus the system fails to operate. As such, 
we consider the best n to use for a particular communication 
complexity of c{n). 
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Fig. 1 

Average packet loss between pairs of nodes for UDP based transmission within PlanetLab . 
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Average bandwidth between pairs of nodes for UDP based transmission within PlanetLab. 
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Average round trip time between pairs of nodes for UDP based transmission within PlanetLab. 
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Different packet loss scenario for data packet and 
acknowledgment packet. 
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Computation, w and communication, c(n) are performed r 
rounds. 



A. The conceptual approach 

In this section we introduce a simplified conceptual notion 
that communication between computing nodes are zero (an 
ideal environment for parallel computing). This approach is 
similar to that of PRAM model that provided the impetus 
for the existence of better parallel computing models. Our 
approach here is not suitable for practical purposes, but will 
be useful to help understand the theoretical approach. The 
computation and communication are performed for r rounds, 
(see Fig.|5]l. The sending node only knows if a round has 
failed (i.e. at least one packet is lost in the round). When 
a round has failed, computation w (w is measured in seconds 
of work on a processor) is performed again and c{n) packets 
are retransmitted. Here computation is performed again so as 
to provide some penalty for the packet losses. The conceptual 
approach is used to predict the best number of computing 
nodes to use when communication is assumed to be negli- 
gible. Let T(l) — wr represent the time taken to perform 
computation on a single node, thus T{n) = ^ is the time 
taken to perform the same computation on n processors. Since 



(1 — Ps(n,p)yps{n,p) is the probability that i attempts fail 
and the {i + l)-th attempt succeeds, we have: 



^ipf{n,py ^Psin.p) = 



Ps{n,p)' 



(1) 



p gives the expected number of times all the packets are 
transmitted. Therefore, expected time taken on n processor 
with packet loss probability p, T{n,p), is given by T{n,p) = 
war = — ^ — r to represent the total time taken to complete 
the computation on n processors. Hence, expected speedup 
of Se = fl^^^"^ ~ "-Ps('^jP) can be achieved. Using this 
expected speedup for different communication c(n), we can 
determine the optimal number of nodes, n, by solving = 
for n. 

If fc > 2 copies of the same packets are sent in each round, 
the probability of success increases and is given by p'^{n,p) — 
(1 — xhis approach wiU require kc{n) packets to be 

transmitted from the sending nodes and it is better than when 
k = I, with < p < 1, and can be shown as below: 



(1 



P>P , 
(1-P)<(1-/ 

_p)2c(n) < 

Ps{n,p) < Ps{n,p)- 



k\2c(n) 



(2) 



Transmission of k copies of the same packet produces higher 
probability of transmission success. Techniques used by high 
performance data transfer protocols mentioned in the previous 
section can be applied to reduce congestion. Their effect on 
the model is not considered here. Fig.|2] shows that, when 
communication is a constant e.g. c(n) = 1, the speedup 
increases Unearly (note the complexity of speedup 0(l), e.g. a 
single point to point communication in a round), when c(n) = 
log2{n) (e.g. binomial tree, Bruck and recursive doubling lfTsl 
algorithm for broadcast), the speedup is monotonically increas- 
ing with n, o(n-^^~^^ ■*)■ However, when c{n) = log^in), 
c{n) = n (e.g. Van de Geijin|l3] algorithm for broadcast and 
ring method for all-gather collective communication), c(n) = 
nlog2{n) and c{n) = n? (e.g. naive all-to-all algorithm) the 
speedup function is not monotonic and there exists an optimal 
value of n for a given p. The optimal value gives an indication 
on possible number of nodes n to use when communication 
cost is zero. Furthermore, it also shows the scalability of an 
algorithm with different type of communication and varying 
packet loss probability as depicted in Fig.|7] 

The conceptual approach used in this section can be sim- 
plified further by using : 

h 

T, \ 

1 



lim 

h — >oo 



Probability of success p'^{n,p) = (1 — can be ap' 



proximated as p'^{n,p) 
c{n) = log2{n), c{n) = 



-2p c(n) wjjgjj communication is 
log2{n), c{n) = n, c{n) = nlog2{n), 
and c{n) = the expected speedup can be approximated as 
p is small. There exists an optimal 
value n for given p when communication c(n) is not 1 or 
log2{n). For c{n) — log2{n), c{n) = n, and c(n) = 
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Fig. 7 

Graph depicts speedup achieved for communication c(n) at different packet loss probabilities Vs number of processing nodes 

WITH fc = 2 for the conceptual APPROACH. 
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Retransmission OF packets using UDP in BSP type parallel 

PROGRAM in a SUPERSTEP. 



optimal number of processors to use is [e ■'p'' J , J ™d 
respectively. When c(n) = nlog2{n), no analytical 



solution exists but a numerical solution can be found. 

III. The Lossy BSP (L-BSP) MODEL 

In this section a model to better reflect the behavior of 
parallel programs on VLSG is introduced. This model, called 



the Lossy BSP (L-BSP), includes important characteristics 
of the internet such as average end-to-end round-trip time 
and average end-to-end bandwidth. We fix value of 2t as 
the timeout period for sending data packets to and receiving 
acknowledgment packets from its destination respectively, 
refer Fig.|6] r is defined as: 



c(n) 



a + p. 



where a = P'^'^'^f ^"-f^ and B, the delay is the round trip time 
(includes cost for sending data and receiving acknowledgment 
packet). r(l) = wr is the time taken for computation in 
one processor, with r as the number of times computation 
is repeated (known as supersteps in BSP model). Whereas, 
Tfl = ^ + 2/5r refers to the expected time that n processors 
will take to complete a single round as shown in Fig.|6] With 
zero packet loss (i.e. p = 1), time T{n, r) — + 2rT for 
computation and communication on n nodes can be achieved. 
Ideally, we expect speedup Sj = \ = n, assuming 
zero communication cost (i.e. r = 0). However, with an 
expected average number of transmission p = ^ (all 
packets are re-transmitted if a packet is lost), the expected 
time taken for computation and communication is given by 
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T{n,p,T) ^ 
achievable is: 

Sk = 



T(l) 



2rT 



p ^^^^^ . Hence, the expected speedup 

_ w _ nGps{n,p) 



Ti 



2rT 



2t 



1 + Gps{n,pY 



n ' ps{n,p) n Ps{n,p) 

where granularity (ratio of computation and communication 
costs) is G = 

^ 2nr 

If only lost packets are re-transmitted, data packets that 
have been successfully sent will not be re-transmitted again. 
Therefore, the sequence of packet transmission is given by 
c{n), pc{n), p'^c{n), p^c{n),- ■ ■ . That is, in the first transmis- 
sion c{n) packets are sent, in the second transmission only 
packets that are lost (i.e. pc{n)) are re-transmitted and so on. 
The notion used in ([U can also be applied in this model. If 
Ps = (1 — p)^ is the probability of successful delivery of a 
single packet and c(n) is the number of packets transmitted 
then the probability that a communication terminates in the 
i-th re-transmission is (1 — psY^^^'^Ps for a single packet. 
Thus, Ps)^''^^^Ps is the probability that a commu- 

nication terminates by ith re-transmission and [X]}=i(l ^ 

Ps)^"'~^''Ps] is the probability that all communications 
terminate by i-th re-transmission. It follows that, the average 
number of re-transmission can be obtained from: 

■ i -1 c(n) 



p{ps,c{n)) 



i-l 



i=l 



'-5 = 1 

oo 

i=l 



c(n) 



{l-psf-'^Ps 



-| c(n) 



-| c(n) 



(3) 



The value of p{ps,c{n)) depends on packet loss p and 
communication c(n). We use numerical approach to obtain 
values of p{ps^ for different number of nodes. Thus the 
expected speedup obtained using the L-BSP model can be 
simplified as: 

G + p(ps,c{n)) 

with granularity, G — Clearly, speedup approaches 

linearity when G ^ p{ps,c{n)). Fig. [8] shows the effect 
of granularity for different communication c{n). For higher 
communication complexity such as c(n) — nlog2{n) and 
c{n) — V?, Fig. |8(e)| and Fig. |8(f)| respectively, speedup de- 
teriorates at a faster rate. 

Fig. ID depicts the limits of speedup for different probability 
of packet losses when different number of nodes are used. 
It shows that when packet loss is lower, higher speedup can 
be achieved. Number of optimal nodes to use depend on 
different operating conditions such as computational and com- 
munication complexity. When packet loss is higher, speedup 
deteriorates at a faster rate. On the other hand, it demonstrates 
the important effect of granularity on speedup. Linear speedup 
is possible with higher granularity and correct number of 
nodes, this is true even for higher degree of communication 
complexity and packet loss. (e.g. n = 2) 



IV. Optimal packet copies 

The conceptual approach revealed that speedup can be 
increased by transmitting multiple copies of the same packet. 
It is easy to verify that as fc ^ oo, we have ps 1, however, 
it is unrealistic to send that many packet copies. Thus, finding 
optimal value for k is necessary for a given value of p, n, w 
and c{n). 

We consider this scenario in the L-BSP model. If k copies 
of same packets are sent, (IH, can be re-written as: 

^ nGi 

G,+pHplcin))' 

with = (1 — p*^)^, p^{p^g,c{n)) the average number of 
transmissions when k packet copies are used, and Gi — 



where is defined as t^. — k^^^a + (i and 2tj. represents 
the timeout value for sending kc{n) packets. 
Equation (|5]l can be simplified to: 

Sr = 



1 



1 c(n) 



2kp*'c(n)a . 2npp'' 



Assuming communication c{n) 
second term in the denominator. 



— n , 

2fcp'°ra''a 



(6) 

it is clear that the 
, grows quadratically 



as n increases in (|6ll. Using the numerically solved value for 
p^ we find the optimal value of k, by minimizing the product 
of kpf^. Table.|l] shows the dominating term that effects the 
speedup as n oo for different communication c{n). For 
lower communication complexity, such as c{n) = l,/o5|(?i) 
and log{n), the dominating term as n increases is ^"^^ . 

As the time to transmit the packet, a, approaches zero (i.e. 
transmission cost approaches zero) (|6ll can be reduced to: 

71 

lim Se = 



2nl3 



1 



Here, p'^ — > 1, as number of packet copies, k, transmitted 
increases. It indicates that work performed on each node 
should be large enough compared to the average delay between 
nodes to achieve good speedup. 

Fig. [To] shows how speedup is effected by the number of 
packet copies transmitted for work of w = 10 hours. It is 
clear from the figure that for communication c{n) — n, c{n) = 
nlog2{n) and c{n) = n? speedup deteriorates as the number 
of packet copies used increases. This observation concurs 
with the expected decrease in speedup, because of higher 
overhead caused by more packets and higher communication 
complexity. 

Fig.im and Fig. [12] shows predicted speedup with different 
work loads, for different packet loss probabilities. These 
graphs are for n = 2 and n = 131072 nodes respectively 
when fc = 1. As the size of work increases on each processor, 
speedup approaches the total number of processor used for 
higher granularity. 

V. Adapting fundamental parallel algorithm 

In this section, we analyze some fundamental algorithms 
using the L-BSP model, that provides the number of packet 
copies, fc, number of nodes, n, amount of work loads, w, to use 



IEEE TRANSACTION ON PARALLEL AND DISTRIBUTED SYSTEMS , VOL. -,N0.. -, MONTH - 



W=14400, k=l, c(n)= 



W=14400, k=l, c(ii)=logn 







/ 

/p 


=0.01 

=0.05 

=0.10 

=0.50 
=0.90 






/ 

/. 























Number of nodes 



(a) c(n) = 1 







/% 

/ p= 

/ ^ 


=0.01 

=0.05 

=0.10 " 

=0.50 
=0.90 




/ 

/ 

/ 

/ 























Number of nodes 



(b) c(n) = log2{n) 




Number of nodes 



(c) c(n) = log\{n) 



W=14400, k=l, c(n)= 



W=14400, k=l, c{n)=nlogn 



W=14400, k=l, c(n)=n2 



/=0.01 
/p=0.05 
p=0.10 
p=0.50 
p=0.90 



Number of nodes 







f(^o . o i — — - 

/l5i=0.05 - 

/ p=o.io : 

/ p=0.50 
^/ p=0.90 










\ / 

















Number of nodes 



(d) c(n) = n 



(e) c(n) = nlog-iin) 




Number of nodes 



(f) c(n) = 



Fig. 8 

Graph depicts speedup achieved for different number of nodes n, work W = 4 hours, communication c(n) at different packet loss 

PROBABILITY WITH k = 1 FOR THE L-BSP MODEL. 



TABLE I 

DOMINATING TERM FOR SPEEDUP USING DIFFERENT TYPE OF COMMUNICATIONS. 



Case 


Communication c(n) 


Dominating term as n ^ oo 


I 


n2 


2kp''c[n)a 
1^ 


II 


nlog2{n) 


2kp c(n)a 
w 


III 


n 


2kp''c(n)a . 2nl3p*' 

VI VI 


IV 


logl{n) 


2nl3p'' 

^ 


V 


log2{n) 


^ 


VI 


1 


2npp'' 
vt 



depending on packet loss probability, p, and communication 
complexity, c{n). 

The L-BSP model in this paper considers sending data 
that fits into a single packet. However, the maximum packet 
size in the Internet Protocol Version 4(IPv4) is only 65KB, 
to accommodate large data we can: a) assume the usage of 
Internet Protocol version 6 (IPv6) that provides maximum 
packet size of up to 4GB. b) Use multiple communication 
supersteps,7, where 7 = I" data size 1 _ 

^ 1- ' / ' / I packet size I 

Since it was not possible for us to obtain values for packet 
loss probabiUty, round-trip time and bandwidth using IPv6. 



This values are extrapolated from our experiment using IPv4 
on PlanetLab. 

A. Matrix multiplication (Direct implementation) 

Consider the product of two matrices A and B of size 
NxN, producing matrix C of size NxN, where N £ 2"^ 
and m e Z+. Each processor, k, with k G 1, 2, .... P where 
P is the total number of processor, has two submatrices of size 
-^x-^, one from each matrix (i.e. A and B) and are indexed 

as Aij and Bi j, where i,j e 1, 2, ... , \/P denote rows and 
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W=36000, k=l, c(n)=l 



W=36000, k;=l, c(n)=logn 



W=36000, k=l, c(n)=log2n 



n=512 
n=4096 

1=65536 
=i3id72 



Packet loss probability 

(a) c(n) = 1 

W=36000, k=l, c(n)=n 



^ns=tl2 -- 

1=32768 
1=65536 - 
= 1:31Q72- -- 



Packet loss probability 

(d) c{n) = n 





n=5 1 2 
n=4096 
11=32-768 _ 
n=6553e 
=131072 



Packet loss probability 

(b) c(n) = log2{n) 

W=36000, k=l, c(n)=nlogii 



-n-, . 

n=2 
n=512 
n=4096 
11=32768 
[1=65 S36 
= 131072 - 



Packet loss probability 

(e) c{n) = nlog2{n) 



1— 

n=2^ 
n=512 - 
n=4096 
1=32768 
1=65536 
= 131072- 



Packet loss probability 

„2 



(f) c(n) = 



Packet loss probability 

(c) c(n) = log^in) 

W=36000, k=l, c(n)=n2 

n=2 
=512 
=4096 
n=32768 
=65^536 
131072 




Fig. 9 

Graph depicts speedup for different packet loss probabilities given n nodes, work W = 10 hours, and communication c(n) for the 

L-BSP MODEL. 



TABLE II 

Approximate speedup of parallel algorithms for different parameter values using L-BSP model. 



Algorithm 


Matrix multiplication. 


Bitonic 


2D-FFT 


Laplace Equation, 




size {NxN) 


Merge sort 




size (NxN) 


Size, N 


2l5 


231 


234 


218 


No. of processors, n 


2i6 




2ib 




Size of message (bytes) 


2iB 


2l6 


2« 


24 


Packet size, 


2W 


216 


2« 


24 


Packet copies, k 


7 


6 


3 


5 


Bandwidth,(MBytes/s) 


17.5 


17.5 


17.07 


24 


Packet loss probability, p 


0.045 


0.045 


0.0005 


0.0005 


a 

Bnn.dimdth ' 


0.0037 


0.0037 


0.000015 


0.000001 


Delay, /3 


0.069 


0.069 


0.05 


0.05 


Average No. of transmission, p*^ 


1.025 


1.002 


1.24 


1.0 


Sequential compute time (seconds), Ws 


140765.34 


133.14 


5841.15 


23364.44 


Communication cost (seconds) 


27.54 


28.18 


7.35 


1.7 


Total time in parallel (seconds) 


29.69 


28.194 


7.55 


1.8783 


Communication complexity, c{n) 




0{n) 


0(n2) 


0{n) 


Average processor performance, (GFLOPS) 


0.5 


0.5 


0.5 


0.5 


Speedup, Se 


4740.89 


4.72 


773.4 


12439.43 


Efficiency 


0.072 


0.000036 


0.02 


0.095 
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W=36000, n=131072, cCn)=l 



W=36000, n=131072, c(n)=logn 



W=36000, n=131072, c(n)=Iog2n 



p=0.01 
p=0.05 
p=0.10 
p=0.50 
p=0.90 



p=0.01 
p=0.05 
p=0.10 
p=0.50 
p=0.90 



=0.01 
=0.05 

=o.io 

=0.50 
=0.90 



Number of packet copies 

(a) c(n) = 1 



Number of packet copies 

(b) c(n) = log2{n) 



Number of packet copies 

(c) c{n) = logl{n) 




W=36000, n=131072, c(n)=nlogn 



W=36000, n=131072, c(n)=n2 



Number of packet copies 

(d) c(n) = n 




Number of packet copies 

(e) c(n) = nlog2(n) 



1 

p=0.01 
p=0.05 
p=0.10 
p=0.50 
p=0.90 



Number of packet copies 

,2 



(f) c(n) = 



Fig. 10 

Graph depicts speedup for different packet copies given n nodes, work W = 10 hours, communication c(n) at different packet loss 

PROBABILITY FOR THE L-BSP MODEL. 
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2 
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1 




Fig. 13 

Parallel matrix multiplication on P = 16 nodes. 



columns respectively. Each submatrix contains ^ elements. 
We assume each processor in the system have a submatrix 
of A and B in them. Squares with same numbers in Fig. [13] 
denotes submatrix C, j that can be computed concurrently. 
During the communication phase, c(P) = 2(P2 — p) packets 
are injected into the network. At the end of computation each 



processor has a portion of submatrix C in its possession. On 
a single processor the cost of computing is 2N^ — N"^. Using 
P processor the cost of sending submatrices from different 
processor to compute submatrix Ci j as shown in Fig. [13] is 
27/5'^(2(\/P— \)ka + (3) seconds. The cost of computation is: 

9 AT-' 
^ P 

speedup is: 



^ FLOPS with the L-BSP model. Thus the expected 



Se 



Ws 



Wp + 27p'=(2(VP - l)ka + (3) 



with. 



Ws 



Average FLOPS "^P 



Average FLOPS' 



Using this model, we analyzed achievable speedup for 
different node sizes P ~ 2^ where s = 1,2,3, ...,17 
and for different matrix dimensions, NxN where N — 
2", 2^2, 213, 2", 2^^ A best speedup of 4740.89 is obtained 
when N = 2^^ and P = 2}'^ , Table.|II] shows the algorithm 
parameters for this speedup. The analysis shows that matrix 
multiplication algorithm is very well suited for parallelization 
on VLSG with our approach. Although the efficiency is low 
at 0.072, it is interesting to note that the problem is solvable 
at almost 4741 times faster using 2^^ nodes compared to on 
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=2, c(n)=logn 



=2, c(n)=log2n 



1 

p=0.01 
p=0.05 
p=0.10 
p=0.50 
p=0.90 



Work (seconds) 

(a) c(n) = 1 



T 

p=0.01 
p=0.05 
p=0. 10 
p=0.50 
p=0.90 



Work (seconds) 

(b) c(n) = log2{n) 



— I — 

=0.01 
=0.05 

=o.io 

=0.50 
=0.90 



Work (.secondsj 

(c) c(n) = loglin) 



: — r-^ 

p=0.01 
p=0.05 
p=0.10 
p=0.50 
p=0.90 



Work (seconds) 



(d) c(n) = n 



=2, c(n)=nlog 



p=0.01 
p=0.05 
p=0. 10 
p=0.50 
p=0.90 



Work (seconds) 

(e) c(n) = nlog2(n) 



1 — 

p=0.01 
p=0.05 
p=0.10 
p=0.50 
p=0.90 



Work (seconds) 

,2 



(f) c(n) = 



Fig. 11 

Graph depicts speedup for different work sizes, for n = 2, communication c(n) at different packet loss probabilities for the 

L-BSP model. 



a single processor. 

B. Sorting (Bitonic mergesort) 

Here, we analyze the complexity of Batcher's bitonic sort 
algorithm lfT4l . Assuming each processor in the system has N 
unsorted keys, rearrange them so that every key in processor 
i is less than or equal to every key in processor i + 1 where 
1 < i < P. First, each processor sorts its ^ keys locally 
(either ascending or descending order, for obtaining bitonic 
sequence). Then this algorithm does log2{P) merge stages as 
shown in Fig. [15] where stage 5 (1 < S < log2{P)) has 
S merge steps. In merge step j {1 < j < S) of stage S, 
each processor i sends the Ust of ^ keys in its possession 
to the processor x (x is obtained by complementing the jth 
bit of i). Thereafter, every processor does the merging and 
keeps either the first half or the second half of the merged 
list. At the end of sorting, each processor i will have ^ 
sorted keys that are less or equal to every key in processor 
i + 1. [15] A total of '°g2(-P)('°g2(P)+i) steps are required 
in this algorithm. In each step, a total of c(P) = P, UDP 
packets are transmitted. The computational cost of sorting an 
unsorted sequence of size N and merging them in parallel is 
flog^if) + bmimmimil^If _ 1) FLOPS, and the total 



communication cost is j{log2{P){log2{P) + l})(ka + (3)p^ 
seconds. The expected speedup can be calculated using the 
L-BSP model as: 

Se = 



wp + -flog2{P){log2{P) + l){ka + I5)p^ 



with. 



Nlog2N 



Average FLOPS 



and 



Average FLOPS 

With this model, achievable speedup for different sizes of 
data, N = 22", 2^^, 2^8, 2^9, 2^0, 2^1 and different number 
of nodes P — 2^ where s — 1,2,3, ...,17 were analyzed. 
The best speedup of 4.722 was obtained when P = 2^^ and 
N — 2^^, Table. nil shows the algorithm parameters for this 
speedup. Although, efficiency is very low for this algorithm 
it is interesting to observe that some speedup can still be 
obtained on VLSG with our approach. 

C. 2D Fast Fourier Transform (FFT) transpose method (FFT- 
TM) 

Fourier transform plays an important role in many scientific 
and technical applications. A fast Fourier transform (FFT) 
has a RAM complexity of 0{NlogN). The FFT can be 
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k=l, n=131072, c<n)=l k=l, n=131072, c(n)=logn k=l, n=131072, c(n)=log2n 




Work (seconds) Work (seconds) Work (seconds) 



(a) c(n) = 1 (b) c(n) = log2{n) (c) c(n) = io(;|(n) 



k=l, n=131072, c(n>=n k=l, n=131072, c(n)=nlogn k=l, n=131072, c(n)=n2 




Work (seconds) Work (seconds) Work ( seconds) 



(d) c(n) = n (e) c(n) = nlog2{n) (f) c(n) = 

Fig. 12 

Graph depicts speedup for different work sizes, for n = 131072, communication c{n) at different packet loss probabilities for the 

L-BSP model. 
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Fig. 14 

Parallel bitonic mergesort for P nodes. 



parallelized using the 2D FFT-TM algorithm. The 2D FFT-TM 
algorithm can be viewed as computing multiple ID FFTs in 
each direction using a fast FFT library (e.g. FFTW) and it has 
a couple of all-to-all communication for inter-processor data 
distribution. During this communication, each node will send 
a portion of its ^ data (i.e. data) to P — 1 processors. The 



received data (complex numbers with datum size of 16 bytes) 
is then re-arranged to complete the transpose process. We 
assume the cost of rearranging the data to be insignificant. In 
our analysis, all the processor has ^ data in their node initially 
and only one UDP data packet is required to send portion 
of its data to another node. A total of c(P) P{P - 1) UDP 
data packets of size where 6 is the data size, are trans- 
mitted during the all-to-all data distribution. The total parallel 
computation cost for this algorithm is given by 10^^o.g(^) 
iFLOPs and the communication cost is ijp'^ (^ka{P — 1) + /?) 
seconds. Following, the expected speedup can be calculated 
using L-BSP model as: 

in 

Se = 



with, Ws 



5Nlog{N) 



Wp + 4-/p''{ka{P - 1) + /3) 



and Wr, 



Average FLOPS P Average FLOPS ' 

Using this model, speedup for different node sizes P = 2* 
where s — 1, 2, 3, . . . , 15 and for different data sizes, N where 
iV ^ 2^0, 2^2, 2^4, 2^6, 238 ^gj.g analyzed. A best speedup of 
773.4 with efficiency at 0.02 is obtained when N = 2'^'' and 
P = 2^^. Table. nil shows the algorithm parameters for this 
speedup. The analysis shows that 2D-FFT algorithm maybe 
suitable for parallelization on VLSG with our approach for 
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0, 



very large data size. 

D. Partial differential equation: The Laplace's Equation 

Partial differential equation is used in many different fields 
of computational science to model real world phenomena. One 
of the fundamental partial differential equation is the Laplace's 
equation: 

f{0,y) = U,iy)J{l,y) = U2{y), 
fix,0) = U3ix)JixJ)^U4x), 
< X < 1,0 < y < I. 

The solution f{x,y), for this equation with given boundary 
conditions can be found using the finite difference method. 
First the Laplace equation must be discretized and the resulting 
system of linear equation is then solved. In matrix- vector form, 
this system of linear equation has a sparse matrix with 5 
(nonzero elements) diagonals. In this paper, we analyze the 
Jacobi method used to solve this system. We can represent 
the function f{x, y) by its values at discrete set of uniformly- 
spaced network of mesh points, Xi = iAx and yj — jAy 
for i = 0, 1, . . . , m and j — 0, 1, . . . , g with Ax = ^ and 
Ay = jj. The grid size h of x-dimension and y-dimension are 
chosen to be equal, h = Ax = Ay to simplify our analysis. 
The function f{x,y) at any point {xi,yj) is represented by 
fi,j. The finite difference representation for Laplace equation 
is, 

fi+l,] - "^fi,] + fi-l.j fi,j + l ~ + fiJ-1 ^ 

and can be rearranged as, 

ftp = ^ [fi+i.j + fi-ij + fi,j+i + ftj-i] ■ 

where f'^'^^ is the value obtained from fc+lth iteration and f'^ 
is the value obtained from the kth iteration. The {i,j)th point 
is computed from the zjth (product of i and j) equation. This 
results in a system of linear equation with (to — 1)^ equations 
and (to — 1)^ unknowns. Jacobi method can be used to solve 
this equation and is described as: 

1 



■ [b, - a,j 



(m-1)" 



> 5, (P is the 



For a pentadiagonal system with — p- 
number of processors used) each node is required to exchange 
at most 3 newly calculated values of unknowns between 
neighboring nodes. Thus, c(P) = 2(P — 1) packets of size 
36 bytes, where 6, is the data size in bytes, are injected 
into the network at any one communication. For a diagonally 
dominant matrix, which is the case for Laplace equation, the 
Jacobi method will converge to "a good solution" in log2P 
steps, however this depends on the initial value used and the 
convergence rate. In our analysis, we take log2P as the number 
of rounds required for convergence. 

The total parallel computation cost for this algorithm is 
2d/off2P( ^"'p^'' ) FLOPs, where d is the number of diagonals 



in the matrix (d = 5 for a pentadiagonal system). The total 
communication time is 2log2Pp^ {ak '^'^^p'^^ + 0) seconds. 
Following, the expected speedup can be calculated using L- 
BSP model as: 



Wp + 2pHog2P{ka^^^ + 0) 



with, Ws — ^"^'"g^^^™ ^) and w„ 

' ^ Average FLOPS P 



Average FLOPS ' 



2dl 



■ Exchange 
Exchiinge 



Exchange 



Fig. 15 

Each node computes p ' points and exchanges at most 3 

NEWLY COMPUTED VALUES IN A PENTADIAGONAL SYSTEM. 



With this model, achievable speedup for different dimension 
TOxm, where to — 2^^, 2^^, 2^^, 2^^, 2^^ and different number 
of nodes P = 2^ where s = 1,2,3, ...,17 were analyzed. 
The best speedup of 12439.43 was obtained when P = 2^^ 
and TO — 2^*, Table. HIl shows the algorithm parameters for 
this speedup. Although, efficiency is low at 0.095 for this 
algorithm, it is interesting to observe that reasonable speedup 
can still be obtained on VLSG with our approach. 

E. Broadcast 

The broadcast operation is a fundamental primitive in many 
parallel applications. A broadcast is a communication pattern 
where a source node sends messages to all other processors in 
the system. Two commonly used algorithms are the binomial 
tree for short messages and the one proposed by Van de 
Geijn fW\ for long messages. In the binomial tree algorithm, 
the root node Pq sends data to node Pq_|_£. These nodes then 
act as the new roots within their own subtrees and recursively 
distributes the messages. This communication takes a total of 
\logP~\ steps, in the following analysis, we consider messages 
that fit into a single packet. In step logP, c{P) — logP packets 
are communicated. Thus, the total cost of communication in 
L-BSP model can be simplified as: 



ibcast 



^(l_2r'°g^l-i)+/3RogPl 



F. All-gather 

The all-gather is an operation where data from different 
processors are gathered on all processors. Three different 
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algorithms can be used for this operation: the ring method, 
recursive doubling and Bruck algorithm. In the ring method 
each processor i sends its portion of data to processor i + 1 
and receives data from processor i ~ 1 (with a wrap around). 
In the next step, each processor i forwards to processor i + 1 
the data it received from i — 1 in the previous step. These 
steps are repeated for P — 1 times, where P is the number 
of processor. If N is the total number of data to be gathered 
on each processor, then at every step each processor sends 
^ data from P — 1 other processors. lfTSl Here we consider 
data size that fit into a single packet. Thus, the total cost 
of communication for ring method in the L-BSP model is, 

tallgather = [ka + /?] (F - 1) p'' . 

VI. Related work 

Many computational models have been developed for par- 
allel architecture, each of them try to reflect the behavior of 
parallel algorithms running on parallel architecture. As stated 
by Maggs et al 1 17|, models should balance between simplicity 
with accuracy, abstraction with practicality, and descriptivity 
with prescriptivity. Early models such as PRAM [18| and 
its variants that emphasize on PRAMs weakness (e.g. Phase 
PRAM im, APRAM EqI, LPRAM El], and BPRAM ll22l ^ 
have been developed for parallel architectures. Other models 
such as Postal model |23|, BSP [121 and LogP |24| that 
considers communication costs such as network latency and 
bandwidth were developed to better reflect behavior of parallel 
algorithms. Variants of BSP such as D-BSP 125], JSg], E- 
BSP 127] and COM (|28], l29], JSO], ||3T] and models that 
take memory hierarchy into consideration such as Parallel 
Hierarchical Memory model (P-HMM) [|32], LogP-HMM 
and LogP-UMH 1331 have also been developed. However, 
there is no single model that has become a standard choice 
for the parallel computing community. This is due to the 
heterogeneity in communication topology and architecture. 
Computational model for grid platform is still at its infancy 
and not many models have been developed for this platform 
yet. The two well known computational models for grid are the 
k-Heterogeneous Bulk Synchronous Parallel (HBSP*^) ||34] , 
DynamicBSP l35] and BSPGRID |36|. Although some of 
these models do consider communication and network latency, 
they, however, do not consider packet loss as a fundamental 
parameter in their models. This could be because most of 
these models assume the usage of TCP protocol for internode 
communication purposes and other factors in TCP that far 
outweighs the impact of packet losses. In our approach, we 
used UDP as the communication protocol. It is well known that 
packet loss and congestion control mechanism (such as rate 
based congestion control) are the main contributor to UDPs 
performance. In this paper we concentrate our attention on 
the impact of packet loss on performance and the usage of 
multiple packet copies to improve performance. 

VII. Conclusion 

Providing an accurate model to reflect the behavior of 
sequential program running on a single computer is difficult 
due to current technologies in computer architecture. On grids. 



it becomes even harder as there are many more factors that 
influence the behavior of computing resources and network. 

Experiments that run parallel programs on PlanetLab in- 
dicates that the communication phase between nodes on a 
WAN is the main bottleneck effecting performance of parallel 
programs, even when computing resources are not highly 
loaded by other jobs. Thus, it is very necessary to utilize the 
fastest available protocol (UDP) to execute parallel programs 
on grids. The weakness of UDP can be remedied by using a 
Ught-weight mechanism for reliability to enhance achievable 
speedup. 

In this paper, a new model based on the BSP model that 
considers packet loss probability as a fundamental parameter 
is introduced. We measured average packet loss, round trip 
time, and bandwidth for UDP between pairs of nodes within 
PlanetLab to better understand the dynamics of WAN. This 
information is then used in our model to derive the optimal 
number of packet copies to use in order to maximize the 
speedup of parallel programs. The effect of packet loss on 
performance of parallel programs is shown. 

We also analyzed a few fundamental algorithms using the L- 
BSP model. Although the efficiency is very low in some cases, 
the result shows that it is possible to obtain some speedup 
when large number of nodes are used. It is also important to 
note that the result do not incorporate the effect of memory 
hierarchy. 

In future work, other features such as replication of parallel 
program for fault tolerance and reliability are being con- 
sidered. We intend to evaluate the performance of parallel 
program based on L-BSP model and detailed packet loss 
model l37] for TCP 
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